Fast and Accurate Language Detection in Short Texts using Contextual Entropy

نویسندگان

  • Edgar Chávez
  • Moises Garcia
  • Jesús Favela
چکیده

In this work we address the problem of Language identification (LI) on short segments of text. The central idea is to compute the entropy of a document in different contexts and assign it to the category where the entropy is maximal. Only word distributions are needed for the task, no other training is done. For LI the contexts are the languages, and classification is done by just evaluating the high order entropy of the text. Our results show that the language of the text, in the challenging case of short texts, can be accurately identified, matching state of the art approaches reported in the literature. Our method is also fast, given its simplicity, it is easy to code and needs no training, aside from the estimation of words distributions for each language, if not already available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Impact of Input Enrichment in Long Text vs. Short Texts on Grammatical Accuracy in Writing Among Elementary Language Learners

This study was conducted to investigate the influence of teaching accurate grammar inwriting via enriched long text and short text for the elementary students atShokouhe_Farhang institute. The homogenized subjects were divided into two groups of 18and 17 participants. Using a writing exam as a pretest in order to check the students’knowledge in English past tense. The control group received the...

متن کامل

A Novel Method for Detection of Epilepsy in Short and Noisy EEG Signals Using Ordinal Pattern Analysis

Introduction: In this paper, a novel complexity measure is proposed to detect dynamical changes in nonlinear systems using ordinal pattern analysis of time series data taken from the system. Epilepsy is considered as a dynamical change in nonlinear and complex brain system. The ability of the proposed measure for characterizing the normal and epileptic EEG signals when the signal is short or is...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

The Comparative Effects of Using Electronic Short Story Books and Tradi-tional Printed Texts on EFL Learners’ Reading Comprehension

The purpose of this study was to investigate the comparative effect of using electronic short story books and traditional printed texts on EFL learners’ reading comprehension. For that purpose, ninety female learners ranging in age between fifteen and thirty five sat for the language proficiency test (PET, 2009) as the test of homogeneity and consequently sixty students were selected based on t...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Research in Computing Science

دوره 90  شماره 

صفحات  -

تاریخ انتشار 2015